Introduction: Topic Modeling and LDA
Topic modeling in Natural Language Processing (NLP) is a technique used to discover hidden themes or topics within a collection of text documents. It’s an unsupervised machine learning technique, meaning it doesn’t require predefined tags or training data that’s been previously classified by humans. The main objective of topic modeling is to discover topics that are clusters of words expressed as a combination of strongly related words.
One popular algorithm for topic modeling is Latent Dirichlet Allocation (LDA). Topic modeling is used in various applications such as chatbots, autocorrection, speech recognition, language translation, social media monitoring, hiring and recruitment, email filtering, and more.
What’s LDA?
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. The topic proportions of a document are assumed to have a Dirichlet prior. The topic-specific word distributions also have a Dirichlet prior.
LDA has these key assumptions:
- Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.
- Documents are exchangeable.
Requirements for the data:
- Each document must represent a mixture of topics.
- Each word must be generated from a single topic.
What are the fixes by far? What are their problems?
- Merge all short documents by the same users into a long document
- Use a single topic for each short document
- Per-shot-text LDA (aka. Twitter-LDA)
- Dirichlet-Multinomial Mixture (DMM)
- Learn topics from longer documents (e.g., news articles) and apply them to short texts
- Classify shot texts utilizing neural networks
- The word mover’s distance (WMD)
- Word embeddings
- Clustering techniques
However, while there are methods available for analyzing short-text documents, they do have some limitations. Specifically, these methods do not retain user information and co-occurrence of words within the same short texts. Additionally, they require a corpus of long text that is already compatible with the short-text documents, and relying on pre-trained word embeddings may not accurately reflect the specific vocabulary and semantic usage of words in the short texts.

Introducing the stLDA-C Model
The stLDA-C model was proposed by Tierney et al. in their paper “Author Clustering and Topic Estimation for Short Texts.” This model particularly aims to improve topic estimation in brief documents, such as social media posts, and incorporates the grouping of authors for more effective analysis.
stLDA features:
- Short text LDA topic model with unsupervised clustering of authors of short documents - Fusing the clustering of both authors(users) and documents
- Hierarchical model capalbe of sharing information at multiple levels leading to higher quality estimates of per-author topic distributions, per-cluster topic distribution centers, and author cluster assignments.
The stLDA-C model is specifically designed to handle the sparsity of words in short texts by considering the additional structure provided by user clusters and potentially by integrating external information or employing different priors that are more suitable for short texts.
What’s new in the stLDA-C model?
To understand what’s new in the stLDA-C model, let’s first take a closer look at the traditional LDA model.
Quick summary of the traditional LDA notations:
W: Word
Z: Topic
LDA Input:
- M number of documents
- Each of these documents have N number of words
LDA Output:
- K number of topics (cluster of words)
- Φ distribution (document to topic distribution)
Compared with the traditional LDA, the stLDA-C model adds a layer of user clustering and a layer of hierarchical topic distributions. From the diagrams, we can see that the stLDA-C model introduced several changes and additions:
- The model considers \(G\) clusters of users, where \(G\) is a hyperparameter.
- \(G_u\) represents the assignment of each user to a specific cluster, governed by the \(\phi\) parameter.
- \(\alpha_g\) is the vector parameter of a Dirichlet distribution over topics choices for users in cluster \(g\).
- \(\phi\) represents the distribution over user clusters. It forms a prior for the user cluster assignments. In traditional LDA, there is no concept of user clusters, so this parameter \(\nu\) is specific to stLDA. \(\phi\) encodes the proportion of users in each group and forms a prior distribution for \(Gu\).
- \(\theta_u\): Because the model assumes that each document (tweet) is generated by a single topic, the consideration for the document-topic distribution is replaced by user-topic distribution. Each user-specific topic distribution \(\theta_u\) is a draw from \(Dir(\alpha_g)\), where \(g\) is the cluster assignment of user \(u\).
- \(Z_{ud}\) is the topic of each tweet \(d\) by user \(u\). \(Z_{ud}\) is a single draw from \(\theta_u\), and all words in tweet \(ud\) are sampled from the topic distribution over words, \(\beta_t\), where \(Z_{ud} = t\).
The generative process of the stLDA-C model is as follows:
TL;DR
Very intimidating, right? Let’s break it down:
Three key takeaways from the stLDA model:
- User Clustering: stLDA clusters users by topic preferences, enhancing the analysis of datasets where authorship is significant.
- Hierarchical Topic Distributions: The model employs hierarchical priors for nuanced cluster-level and user-specific topic analysis.
- Integrated Topic-User and Word Analysis: stLDA combines topic-user dynamics with word co-occurrence for comprehensive short text analysis.
